Recombination Events in T H E History of a Sample of Dna Sequences
نویسندگان
چکیده
Some statistical properties of samples of DNA sequences are studied under an infinite-site neutral model with recombination. T h e two quantities of interest are R, the number of recombination events in the history of a sample of sequences, and RM, the number of recombination events that can be parsimoniously inferred from a sample of sequences. Formulas are derived for the mean and variance of R. In contrast to R, RM can be determined from the sample. Since no formulas are known for the mean and variance of RM, they are estimated with Monte Carlo simulations. It is found that RM is often much less than R, therefore, the number of recombination events may be greatly underestimated in a parsimonious reconstruction of the history of a sample. T h e statistic Rm can be used to estimate the product of the recombination rate and the population size or, if the recombination rate is known, to estimate the population size. To illustrate this, DNA sequences from the Adh region of Drosophila melanogaster are used to estimate the effective population size of this species. HE neutral infinite-site model introduced by KIMURA (1971) is a natural framework for analyzing nucleotide sequence data. Much of the analytical development of this model has been for the cases in which the rate of recombination is zero or infinite (WATTERSON 1975; EWENS 1979). Recently, HUDSON (1983b) has studied some properties of this model when the rate of intragenic recombination is finite. Using a result of GRIFFITHS (1981) for a two-locus model with finite recombination, he derived a formula for the variance of the number of segregating sites in a sample of size 2 and obtained an approximation for the expected homozygosity. HUDSON (1983b) has also developed an efficient method for simulating samples from the neutral infinite-site model with finite recombination. His method generates the "history" of a sample. The history of a sample is a collection of correlated family trees, one for each site (for DNA sequence data, each nucleotide is considered a site). The family tree for a site traces the genealogy of a site back to its most recent common ancestor indicating which sampled gametes are most closely related and when the most recent common ancestors Genetics 111: 147-164, September, 1985. 148 R. R. HUDSON AND N. L. KAPLAN occurred. If the rate of recombination is 7er0, then each site has the same family tree and therefore the history of the sample consists of just one tree. T h e method for generating this tree depends on results of WATTERSON (1975) (for details see HUDSON 19831; TAJIMA 1983). On the other hand, if the recombination rate is infinite, then all of the family trees are independent of each other, and each family tree is generated in the same way as when the recombination rate is zero. If the recombination rate is finite, then the topologies and lengths of the branches of the Family trees are correlated because of linkage, and generating them is more complex but still possible (HUDSON 1983b). Let generation t ( t P 0) denote the population t generations before the present one from which the sample is taken. Those gametes in generation t that have descendants in the sample are referred to as the ancestral gametes in generation t . For any site the number of ancestral gametes in generation t is just the number of branches t generations before the present one in the family tree of that site. If an ancestral gamete in generation t 1 is the recombinant descendant of two ancestral gametes in generation t , then we say that a recombination event has occurred in generation t. Let R denote the total number of recombination events in the history of the sample. T h e object of this paper is to study the statistical properties of R. If the rate of recombination is zero, then R = 0, and if the rate of recombination is infinite, then R = W . Thus, only when the recombination rate is finite and nonzero is R interesting. For this case formulas are derived for the mean and variance of R for arbitrary sample size. Although R is a quantity of interest from a theoretical point of view, its drawback is that it cannot be evaluated from data, since the history of a sample is never observed. A way of inferring that between two sites at least one recombination event took place in the history of the sample is to use the “fourgamete” test. This test can be explained in the following way. For the infinitesite model the mutation rate for any site is infinitesimal; therefore, at most one mutation event can occur in the history of the sample at that site. Thus, for any two sites there are at most four gametic types in the population. Furthermore, since the model does not allow for back mutation and recurrent mutation, the only way for all four gametic types to be in the sample is for at least one recombination event to have occurred in the history of the sample between the two sites. Not all recombination events in the history of the sample are revealed by the four-gamete test. For a recombination event to be detected by this test, the history of sampled gametes must have a specific structure and mutations must occur on appropriate lineages of the family trees. It is shown that even for extremely high mutation rates and moderate sample sizes a substantial fraction of the recombination events in the history of the sample can never be detected using the four-gamete test. Let R.V, denote the minimum number of recombination events implied by the data using the four-gamete test (see APPENDIX 2). T h e statistical properties of Rtr are of interest since this quantity arises naturally when one attempts to RECOMBINATION EVENTS 149 actually construct the history of the sample. Furthermore, RM may be useful in estimating the rate of recombination. The mean and variance of RM are complicated functions of the mutation and recombination rates. It is shown that E(RM) is an increasing function of the mutation rate and the limiting value of E(RM) is identified. The statistical properties of R and R M for different rates of mutation and recombination are also examined using the simulation methods of HUDSON (1983b). Finally, the results in this paper are discussed in light of the recent data set published by KREITMAN (1 983). In particular, the effective population size of Drosophila melanogaster is estimated from RM. STATISTICAL PROPERTIES OF R Let 2N denote the population size which is assumed to be fixed, c the rate of recombination per generation per gamete, U the rate of mutation per generation per gamete and n the sample size. Both c and u are assumed to be of order 1/N; therefore, it is convenient to define f3 = 4Nu and C = 4Nc. For simplicity the chromosome under study is represented by the interval [0, 11. Suppose that for any integer, m, the genome is divided into m equal segments which are labeled from 1 to m starting from the left. One then has the identity
منابع مشابه
Perspective on Possible Recombination Event in Fusion Protein Gene of Newcastle Disease Viruses Isolated in Iran
Background and Aims: Newcastle disease (ND), caused by the virulent Newcastle disease virus (NDV), is one of the most important viral diseases in birds. In recent years recombination occurring throughout the NDVs genome isolated in China and Indonesia has been reported. This study was focused to investigate the recombination events in the F gene of the Iranian NDVs to generate useful data that ...
متن کاملIdentification of Novel Mutations in IL-2 Gene in Khorasan Native Fowls
The intron-exon structure of Khorasan native fowl interleukin-2 (IL-2) was investigated. For this purpose, twenty chickens were selected from the Native Fowl Breeding Station of Khorasan province, and genomic DNA was extracted using a modified conventional DNA extraction protocol. An 875 bp fragment of IL-2 was successfully amplified, including a small part of the promoter, exon 1, intron 1, an...
متن کاملP-230: Analysis of TEX15 Expression in Testis Tissues of Severe Oligozoospermic and Non-Obstructive Azoospermic Men Referred to Royan Institute
Background: TEX15 is a novel protein that is required for chromosomal synapsis and meiotic recombination. Human TEX15 is located on chromosome 8(8p12 region) and expressed in testis and ovary, as is its mouse ortholog. Loss of TEX15 function in mice causes early meiotic arrest in males but not in females. Specifically, TEX15 deficient spermatocytes exhibit a failure in chromosomal synapsis. In ...
متن کاملMitochondrial DNA variation, genetic structure and demographic history of Iranian populations
In order to survey the evolutionary history and impact of historical events on the genetic structure of Iranian people, the HV2 region of 141 mtDNA sequences related to six Iranian populations were analyzed. Slight and non-significant FST distances among the Central-western Persian speaking populations of Iran testify to the common origin of these populations from one proto-population. Mismatch...
متن کاملA Bayesian Phylogenetic Model for Counting Recombination Event
We describe a Bayesian method for counting recombination events in the evolutionary history of aligned viral sequences when recombination is rare. Previous recombination detection methods have focused on testing whether particular sequences are recombinant. Our method, in constrast, examines multiple recombinant sequences with the same recombinant structure and seeks to place a lower bound on t...
متن کامل